It is important in any data science project to define the objective as specific as possible. Below let's write it from general to specific. This will direct our analysis.
Section 1: Initial Steps
Section 2: Data Cleaning and Preparation, Feature Engineering
Section 3: EDA of all variables and binning
Section 4: Models
The generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include Poisson, binomial, and gamma distributions. Each serves a different purpose, and depending on distribution and link function choice, can be used either for prediction or classification.
The GLM suite includes:
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html
Automated machine learning (AutoML) is the process of automating the process of applying machine learning to real-world problems. AutoML covers the complete pipeline from the raw dataset to the deployable machine learning model.
AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning. The high degree of automation in AutoML allows non-experts to make use of machine learning models and techniques without requiring becoming an expert in the field first.
H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.
The H2O AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time constraint or limit on the number of total models trained.
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. H2O also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models. The H2O platform is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities.
# Import all packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import scipy
import time
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings("ignore")
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score, accuracy_score, confusion_matrix
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import plotly
import plotly.express as px
from imblearn.datasets import make_imbalance
import pylab as pl
from collections import Counter
# Read the data
df = pd.read_csv('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 2 EDA/XYZloan_default_selected_vars.csv')
df.head(2)
print("Number of rows and columns in the dataset:")
df.shape
# Check basic statistics
print("Basic statistics of the columns are as follows:")
df.describe()
AP006¶df['AP006'].hist()
df.AP006.hist()
df['AP006'].value_counts()
loan_default¶# Check the target variable column
print("The number of 0's and 1's are:")
print(df['loan_default'].value_counts())
df['loan_default'].hist()
#df.info()
Unnamed: 0, Unnamed: 0.1 and id. They need to be dropped.
AP005 is a Date-Time column, which cannot be used for any predictions in the model. Date-Time columns act as an ID column and all have unique values, which misrepresents the variable while making predictions. The reason is that this field almost becomes a unique identifier for each record. It is as if you employ the ‘id’ field in your decision trees. I will derive year, month, day, weekday, etc. from this field. In some models, we may use ‘year’ as a variable just to explain any special volatility in the past. But we will never use the raw DateTime field as a predictor.
TD025, TD026, TD027, TD028, CR012.
TD029, TD044, TD048, TD051, TD054, TD055, TD061, TD062.
AP002Gender,
AP003Education Code,
AP004Loan Term,
AP006OS Type,
AP007Application City Level,
AP008Flag if City not Application City,
AP009 Binary format,
MB007 Mobile Brands/type
AP005 to the relevant formats of Year, Month, Day ¶df['AP005'] = pd.to_datetime(df['AP005'])
# Create 4 new columns
df['Loan_app_day_name'] = df['AP005'].dt.day_name()
df['Loan_app_month'] = df['AP005'].dt.month_name()
df['Loan_app_time'] = df['AP005'].dt.time
df['Loan_app_day'] = df['AP005'].dt.day
# Drop old column
df = df.drop(columns = ['AP005'])
df.head(2)
df["AP002"] = df["AP002"].astype('object')
df["AP003"] = df["AP003"].astype('object')
df["AP004"] = df["AP004"].astype('object')
df["AP006"] = df["AP006"].astype('object')
df["AP007"] = df["AP007"].astype('object')
df["AP008"] = df["AP008"].astype('object')
df["AP009"] = df["AP009"].astype('object')
df["CR015"] = df["CR015"].astype('object')
df["MB007"] = df["MB007"].astype('object')
df['Loan_app_day_name'] = df['Loan_app_day_name'].astype('object')
df['Loan_app_month'] = df['Loan_app_month'].astype('object')
df['Loan_app_time'] = df['Loan_app_time'].astype('object')
df['Loan_app_day'] = df['Loan_app_day'].astype('object')
df = df.drop(columns = ['Unnamed: 0', 'Unnamed: 0.1', 'id', 'TD025', 'TD026', 'TD027', 'TD028', 'CR012','TD029', 'TD044', 'TD048', 'TD051', 'TD054', 'TD055', 'TD061', 'TD062'])
df.head(2)
As per all the variable description, all the following columns are either counts, lengths, or days. Hence, the negative values such as -999, -99, -98, -1, etc are all mis-read NA's and need to be converted back to 'nan' format.
features_nan = ['AP001',
'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005']
# Define a function to convert negatives to nan
def convert_to_nan(var):
df[var][df[var] < 0] = np.nan
for i in features_nan:
convert_to_nan(i)
# Verify that the negatives are gone
print("The minimum now stands at 0 for most of the columns, verifying the mis-represented values are gone.")
df[features_nan].describe()
Multivariate imputer that estimates each feature from all the others. A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.
The documentation is here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer
df_2 = df[features_nan]
# Verify
df_2.head(3)
imp = IterativeImputer(missing_values=np.nan, sample_posterior=False,
max_iter=10, tol=0.001,
n_nearest_features=None, initial_strategy='median')
imp.fit(df_2)
imputed_data_median = pd.DataFrame(data=imp.transform(df_2),
columns=['AP001',
'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005'],
dtype='int')
imputed_data_median.head(3)
CR009 to a category variable and bin appropriately ¶df['CR009'] = pd.cut(x=df['CR009'], bins=[-1, 100000, 200000, 300000, 400000, 500000, 600000, 700000, 800000, 900000, 1000000, 1500000])
df = df.astype({'CR009':'object'})
df.CR009.value_counts()
corr = df[['loan_default', 'AP001', 'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010', 'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024']].corr()
f,ax = plt.subplots(figsize=(18,12))
sns.heatmap(corr, annot=True, cmap='Greens', linewidths=.4, fmt= '.1f',ax=ax)
plt.show()
# Remove 1 feeature from a pair which has over 0.7 ratio
# However, H2O deals with this problem smartly, I will not remove the variables
corr_var_drop1 = ['TD005', 'TD022', 'TD006', 'TD009', 'TD013', 'TD023', 'TD010', 'TD014']
I will be using the other variables as they are all Call detail data.
filter_col = [col for col in df if col.startswith('CD')]
filter_col.append('loan_default')
corr = df[filter_col].corr()
f,ax = plt.subplots(figsize=(21,21))
sns.heatmap(corr, annot=True, cmap='Greens', linewidths=.4, fmt= '.1f',ax=ax)
plt.show()
# Remove 1 feature from a pair which has over 0.7 ratio
# However, H2O deals with this problem smartly, I will not remove the variables
corr_var_drop2 = ['CD173', 'CD172', 'CD170', 'CD169', 'CD167', 'CD166', 'CD164', 'CD162',
'CD137', 'CD136', 'CD135', 'CD133', 'CD132', 'CD131', 'CD117', 'CD118',
'CD120', 'CD121', 'CD123', 'CD114', 'CD113', 'CD108', 'CD107', 'CD106',
'CD101', 'CD072']
df_bin = df.copy(deep = True)
df_bin.head(2)
# Write a function and loop through
def binning(var):
df_bin[var + '_bin'] = pd.qcut(df_bin[var],15,duplicates='drop').values.add_categories("NoData")
df_bin[var + '_bin'] = df_bin[var + '_bin'].fillna("NoData").astype(str)
df_bin[var + '_bin'].value_counts(dropna=False)
features = ['AP001', # 'AP002', 'AP003', 'AP004', 'AP006', 'AP007',
# 'AP008', 'AP009',
'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
#'CR009', 'CR015',
'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005'
# 'MB007', 'Loan_app_day_name', 'Loan_app_month', 'Loan_app_time',
# 'Loan_app_day'
]
for i in features:
binning(i)
# View the bins of some variables
print(df_bin['TD001_bin'].value_counts(dropna=False))
print(df_bin['TD022_bin'].value_counts(dropna=False))
% Y by X which is the mean column for all the numerical columns here ¶The 'mean' column represents the '% Y by X'.
def plot_X_and_Y(var):
z = df_bin.groupby(var + '_bin')['loan_default'].agg(['count','mean']).reset_index()
z['count_pcnt'] = z['count']/z['count'].sum()
x = z[var + '_bin']
y_mean = z['mean']
count_pcnt = z['count_pcnt']
ind = np.arange(0, len(x))
width = .5
fig = plt.figure(figsize=(16,4))
plt.subplot(121)
plt.bar(ind, count_pcnt, width, color='r')
#plt.ylabel('X')
plt.title(var + ' Distribution')
plt.xticks(ind,x.tolist(), rotation=45)
plt.subplot(122)
plt.bar(ind, y_mean, width, color='b')
#plt.ylabel('Y by X')
plt.xticks(ind,x.tolist(), rotation=45)
plt.tight_layout()
plt.title('Response mean by ' + var)
plt.show()
#for i in features:
# plot_X_and_Y(i)
% Y by X which is the mean column for all the Categorical columns here ¶The 'mean' column represents the '% Y by X'.
features_2 = ['AP002', 'AP003', 'AP004', 'AP006', 'AP007', 'AP008', 'AP009',
'CR009','CR015', 'MB007', 'Loan_app_day_name', 'Loan_app_month',
'Loan_app_day'
]
def plot_X_and_Y_cat(var):
z = df_bin.groupby(var)['loan_default'].agg(['count','mean']).reset_index()
z['count_pcnt'] = z['count']/z['count'].sum()
x = z[var]
y_mean = z['mean']
count_pcnt = z['count_pcnt']
ind = np.arange(0, len(x))
width = .5
fig = plt.figure(figsize=(16,4))
plt.subplot(121)
plt.bar(ind, count_pcnt, width, color='r')
plt.ylabel('X')
plt.title(var + ' Distribution')
plt.xticks(ind,x.tolist(), rotation=45)
plt.subplot(122)
plt.bar(ind, y_mean, width, color='b')
plt.ylabel('Y by X')
plt.xticks(ind,x.tolist(), rotation=45)
plt.tight_layout()
plt.title('Response mean by ' + var)
plt.show()
for i in features_2:
plot_X_and_Y_cat(i)
From the above graphs, the following variables seem to be not important, as they do not have a pattern or a trend, or a curve on the '% Y by x' graph:
Loan_app_day_namedf_count = df['AP006'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP006 - OS Type','Count']
print(df_count.head())
fig = px.bar(df_count, x = 'AP006 - OS Type', y = 'Count', color = 'AP006 - OS Type',
width=600, height=400,
title = "Distribution of OS type")
fig.show()
df_count = df['AP002'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP002 - Gender','Count']
print(df_count.head())
fig = px.bar(df_count, x = 'AP002 - Gender', y = 'Count', color = 'AP002 - Gender',
width=600, height=400,
title = "Distribution of Gender")
fig.show()
df_count = df['AP003'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP003 - Education','Count']
print(df_count.head())
fig = px.bar(df_count, x = 'AP003 - Education', y = 'Count', color = 'AP003 - Education',
width=600, height=400,
title = "Distribution of Education")
fig.show()
fig = px.box(df, x="TD001",width=1000, height=500,
title = "Distribution of TD001 - TD_CNT_QUERY_LAST_7Day_P2P")
fig.show()
fig = px.box(df, x="MB005",width=1000, height=500,
title = "Distribution of MB005")
fig.show()
fig = px.box(df, x="AP007", y="TD001",width=900, height=400,
color = "AP002",
title = "The Distribution of Level Application City by TD_CNT_QUERY_LAST_7Day_P2P")
fig.show()
fig = sns.pairplot(df[['AP002', 'AP003', 'AP004']],
hue= 'AP004')
fig
# Over write the NA value columns, with the previously calculated values
df[features_nan] = imputed_data_median
df.head(2)
df.isnull().sum().sum()
import h2o
h2o.init()
I will use a 72:25 split for train and test.
# Split the data
train,test = train_test_split(df,test_size = 0.25, random_state = 1234)
# Convert to a h2o dataframe for computation
df_hex = h2o.H2OFrame(df)
train_hex = h2o.H2OFrame(train)
test_hex = h2o.H2OFrame(test)
# This test_hex will be used all througout the models, for prediction
df.loan_default.value_counts()
target = 'loan_default'
# I will make both the classes equal
y = df[target]
X = df.drop(target,axis=1)
y.dtypes
from imblearn.over_sampling import RandomOverSampler
sampler = RandomOverSampler(sampling_strategy={1: 64512, 0: 64512},random_state=0)
X_rs, y_rs = sampler.fit_sample(X, y)
print('RandomOverSampler {}'.format(Counter(y_rs)))
X_rs = pd.DataFrame(X_rs)
y_rs = pd.DataFrame(y_rs)
smpl = pd.concat([X_rs,y_rs], axis = 1)
train,test = train_test_split(smpl,test_size = 0.25, random_state = 1234)
# Convert to a h2o dataframe for computation
train_hex = h2o.H2OFrame(train)
# Selecting all independent variables as predictor variables
predictors = ['AP001', 'AP002', 'AP003', 'AP004', 'AP006', 'AP007',
'AP008', 'AP009', 'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
'CR009', 'CR015', 'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005',
'MB007', 'Loan_app_day_name', 'Loan_app_month', 'Loan_app_time',
'Loan_app_day']
target = 'loan_default'
len(predictors)
len(df.columns.to_list())
H2O supports two types of grid search – traditional (or “cartesian”) grid search and random grid search. In a cartesian grid search, users specify a set of values for each hyperparameter that they want to search over, and H2O will train a model for every combination of the hyperparameter values. This means that if you have three hyperparameters and you specify 5, 10 and 2 values for each, your grid will contain a total of 5 10 2 = 100 models.
In random grid search, the user specifies the hyperparameter space in the exact same way, except H2O will sample uniformly from the set of all possible hyperparameter value combinations. In random grid search, the user also specifies a stopping criterion, which controls when the random grid search is completed. The user can tell the random grid search to stop by specifying a maximum number of models or the maximum number of seconds allowed for the search. The user may also specify a performance-metric-based stopping criterion, which will stop the random grid search when the performance stops improving by a specified amount.
Once the grid search is complete, the user can query the grid object and sort the models by a particular performance metric (for example, “AUC”). All models are stored in the H2O cluster and are accessible by model id.
Examples of how to perform cartesian and random grid search in all of H2O’s APIs follow below. There are also longer grid search tutorials available for R and Python.
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html
H2O supports two types of grid search – traditional (or “cartesian”) grid search and random grid search. In a cartesian grid search, users specify a set of values for each hyperparameter that they want to search over, and H2O will train a model for every combination of the hyperparameter values. This means that if you have three hyperparameters and you specify 5, 10 and 2 values for each, your grid will contain a total of 5*10*2 = 100 models.
In random grid search, the user specifies the hyperparameter space in the exact same way, except H2O will sample uniformly from the set of all possible hyperparameter value combinations. In random grid search, the user also specifies a stopping criterion, which controls when the random grid search is completed. The user can tell the random grid search to stop by specifying a maximum number of models or the maximum number of seconds allowed for the search. The user may also specify a performance-metric-based stopping criterion, which will stop the random grid search when the performance stops improving by a specified amount.
Once the grid search is complete, the user can query the grid object and sort the models by a particular performance metric (for example, “AUC”). All models are stored in the H2O cluster and are accessible by model id.
The overall strategy is to test lambda and alpha. The hyperparameters for tuning are the following:
GLM Hyperparameters
Here, I will tune only lambda and aplha.
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html
from h2o.grid.grid_search import H2OGridSearch
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
# select the values for lambda_ to grid over
hyper_params = {'lambda': [1, 0.5, 0.03, 0.02, 0.1, 0.01, 0.05, 0.08, 0.001, 0.003, 0.005, 0.0001, 0.0005, 0.00001, 0],
'alpha': [1, 0.5, 0]}
# initialize the glm estimator
glm_grid = H2OGeneralizedLinearEstimator(family = 'fractionalbinomial')
# build grid search with previously made GLM and hyperparameters
grid = H2OGridSearch(model = glm_grid,
hyper_params = hyper_params,
search_criteria = {'strategy': "Cartesian"})
# train using the grid
grid.train(x = predictors, y = target, training_frame = train_hex, validation_frame = test_hex)
# sort the grid models by decreasing 'rmse'
grid_table1 = grid.get_grid(sort_by = 'rmse', decreasing = False)
print('')
print("The grid search models as per decreasing 'rmse' is as follows: ")
print('')
print(grid_table1)
grid_sorted2 = grid.get_grid(sort_by='r2',decreasing=True)
# print(grid_sorted2)
best_glm = grid_sorted2.models[0]
print(best_glm)
I see that as per above displayed R square and rmse values, the most optimum parameters for the GLM model are:
I will now build the GLM Models based on these optimum hyper parameters.
The lambda value will be left at 0.
GLM_WO = H2OGeneralizedLinearEstimator(family= "fractionalbinomial",
lambda_ = 0, # No regularization
compute_p_values = True)
GLM_WO.train(predictors, target, training_frame= train_hex, validation_frame = test_hex)
var_imps = pd.DataFrame(GLM_WO.varimp(), columns = ['Variable', 'Relative_Importance',
'Scaled_Importance', 'Percentage'])
The model total has 75 features. After running the initial model with all features, I run many different combinations to get the best LIFT score. I see that the LIFT is best with all the features in the model. So, I will not perform variable importance selection.
#predictors = var_imps['Variable'].head(30).to_list()
# target = 'loan_default'
# GLM_WO = H2OGeneralizedLinearEstimator(family= "fractionalbinomial",
# lambda_ = 0, # No regularization
# compute_p_values = True)
# GLM_WO.train(predictors, target, training_frame= train_hex, validation_frame = test_hex)
var_imps = pd.DataFrame(GLM_WO.varimp(), columns = ['Variable', 'Relative_Importance',
'Scaled_Importance', 'Percentage'])
print()
print("The 10 most important features are: ")
print()
var_imps.head(10)
GLM_WO.varimp_plot()
Use the original test_hex set for prediction
y_pred = GLM_WO.predict(test_hex).as_data_frame()
y_actual = test_hex[target].as_data_frame()
y_pred.head()
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was developed for operators of military radar receivers, which is why it is so named.
GLM_WO_actual_predict = pd.concat([y_actual,y_pred['predict']],axis=1)
GLM_WO_actual_predict.columns = ['actual','pred']
print(GLM_WO_actual_predict.head())
GLM_WO_roc_auc_value = roc_auc_score(GLM_WO_actual_predict['actual'],GLM_WO_actual_predict['pred'])
print('\n' + 'The AUC is: ')
GLM_WO_roc_auc_value
def gains_table(df_actual_predict):
df_actual_predict = df_actual_predict.sort_values(by='pred',ascending=False)
df_actual_predict['row_id'] = range(0,0+len(df_actual_predict))
df_actual_predict['decile'] = (df_actual_predict['row_id'] / (len(df_actual_predict)/10)).astype(int)
df_actual_predict.loc[df_actual_predict['decile'] == 10] =9
# Create gains table
gains = df_actual_predict.groupby('decile')['actual'].agg(['count','sum'])
gains.columns = ['count','actual']
gains
gains['non_actual'] = gains['count'] - gains['actual']
gains['cum_count'] = gains['count'].cumsum()
gains['cum_actual'] = gains['actual'].cumsum()
gains['cum_non_actual'] = gains['non_actual'].cumsum()
gains['percent_cum_actual'] = (gains['cum_actual'] / np.max(gains['cum_actual'])).round(2)
gains['percent_cum_non_actual'] = (gains['cum_non_actual'] / np.max(gains['cum_non_actual'])).round(2)
gains['if_random'] = np.max(gains['cum_actual']) /10
gains['if_random'] = gains['if_random'].cumsum()
gains['lift'] = (gains['cum_actual'] / gains['if_random']).round(2)
gains['K_S'] = np.abs( gains['percent_cum_actual'] - gains['percent_cum_non_actual'] ) * 100
gains['gain'] = (gains['cum_actual'] / gains['cum_count']*100).round(2)
return(gains)
GLM_gains = gains_table(GLM_WO_actual_predict)
GLM_gains
def ROC_PR(df_actual_predict):
print('')
print(' * ROC curve: The ROC curve plots the true positive rate vs. the false positive rate')
print('')
print(' * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy')
print('')
print(' * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)')
print('')
# ROC
roc_auc_value = roc_auc_score(df_actual_predict['actual'],df_actual_predict['pred'])
fpr, tpr, _ = roc_curve(df_actual_predict['actual'],df_actual_predict['pred'])
roc_auc = auc(fpr,tpr)
lw=2
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.plot(fpr,tpr, color='darkorange',lw=lw,label='AUC = %0.4f)' %roc_auc_value)
plt.plot([0,1],[0,1], color='navy',lw=lw,linestyle='--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve: AUC={0:0.4f}'.format(roc_auc_value))
plt.legend(loc='lower right')
# Precision-Recall
plt.subplot(1,2,2)
average_precision = average_precision_score(df_actual_predict['actual'],df_actual_predict['pred'])
precision, recall, _ = precision_recall_curve(df_actual_predict['actual'],df_actual_predict['pred'])
plt.step(recall, precision, color='b', alpha=0.2, where='post')
plt.fill_between(recall,precision,step='post',alpha=0.2,color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0,1.05])
plt.ylim([0.0,1.05])
plt.title('Precision-Recall curve: PR={0:0.4f}'.format(average_precision))
ROC_PR(GLM_WO_actual_predict)
I will use the optimal parameters given by the Grid Search function.
GLM_WITH = H2OGeneralizedLinearEstimator(family = "AUTO",
lambda_ = 0,
lambda_search = True,
alpha = 1,
seed = 1234,
nfolds = 10,
stopping_rounds = 0,
standardize = True)
GLM_WITH.train(predictors, target, training_frame= train_hex)
var_imps = pd.DataFrame(GLM_WO.varimp(), columns = ['Variable', 'Relative_Importance',
'Scaled_Importance', 'Percentage'])
The model total has 75 features. After running the initial model with all features, I run many different combinations to get the best LIFT score. I see that the LIFT is best with all the features in the model. So, I will not perform variable importance selection.
#predictors = var_imps['Variable'].head(30).to_list()
# target = 'loan_default'
# GLM_WITH = H2OGeneralizedLinearEstimator(family = "AUTO",
# lambda_ = 0,
# lambda_search = True,
# alpha = 1,
# seed = 1234,
# nfolds = 10,
# stopping_rounds = 0,
# standardize = True)
# GLM_WITH.train(predictors, target, training_frame= train_hex)
var_imps = pd.DataFrame(GLM_WO.varimp(), columns = ['Variable', 'Relative_Importance',
'Scaled_Importance', 'Percentage'])
print()
print("The 10 most important features are: ")
print()
var_imps.head(10)
GLM_WITH.varimp_plot(num_of_features = 30)
Use the original test_hex set for prediction
y_pred = GLM_WITH.predict(test_hex).as_data_frame()
y_actual = test_hex[target].as_data_frame()
y_pred.head()
GLM_WITH_actual_predict = pd.concat([y_actual,y_pred['predict']],axis=1)
GLM_WITH_actual_predict.columns = ['actual','pred']
GLM_WITH_actual_predict.head()
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was developed for operators of military radar receivers, which is why it is so named.
GLM_WITH_roc_auc_value = roc_auc_score(GLM_WITH_actual_predict['actual'],GLM_WITH_actual_predict['pred'])
GLM_WITH_roc_auc_value
GLM_gains = gains_table(GLM_WITH_actual_predict)
GLM_gains
Check the variable coefficients and standardized coefficients
coefs = GLM_WITH._model_json['output']['coefficients_table'].as_data_frame()
coefs = pd.DataFrame(coefs)
coefs.sort_values(by='standardized_coefficients',ascending=False)
ROC_PR(GLM_WITH_actual_predict)
I see that in the GLM models, the one With Regularization, the result is satisfactory.
In our dataset, the minority class represents 19.3% or around 1/5th of the total data, and the majority class represents 81.7% or 4/5th of the total data, or around 4 times the minority class. The underlying assumption for the stable results is that this dataset is not a bad distribution of class, and 20% data in a one class is fine for building a stable model. If we had more than 5-6 times of the minority class data in the majority class, then that would be be settled well by the balance_class parameter in the model. However, to develop a better model I chose to manually oversample the minority class.
I also observed that for this regression problem, balance classes parameter did not make any difference.
Also, for the train test split, I found that a split of 75:25 works best for all my models. I tried other combination, but a 75% train gives me the best Lift score of 2.17. So, I will continue with the same split ratio further.
The optimum hyper-parameters for the GLM model built above are:
Please see below for the meaningful business insights:
H2O package is a very effective and efficient package to build a machine learning model for predicting loan default. Also, H2O package is very handy to display the variable importance, handle correlations, and also dummy code the categorical variables.
Gains table and Lift:
For the final model I built after tuning all the models on various different values of each parameter and finally tuning all the hyper-parameters for the best result, the highest Lift score I obtaned is 2.17, which is good as per industry standards. A Lift score of above 2 is suitable for the model to be of acceptable standards.
ROC and AUC: The area under the ROC curve (AUC) assesses overall classification performance. But, AUC does not place more emphasis on one class over the other, so it does not reflect the minority class well. The Precision-Recall (PR) curves will be more informative than ROC when dealing with highly skewed datasets. The PR curves plot precision vs. recall (FPR). Because Precision is directly influenced by class imbalance so the Precision-recall curves are better to highlight differences between models for highly imbalanced data sets.
However, both of them do not accurately represent the results, as one doesnt reflect the minority class well, and the other is sensitive to imbalanced data. Hence, we use the H2O package, which takes care of all these problems for us. We get an AUC of 0.69 and PR of 0.35, which is acceptable, but I will try to improve them further in my following models.
The major outcome of this excercise is that GLM is a good model to predict loan default, as per the given data. However, we should not undermine other good boosting models such as xgboost, or RF, or Auto-ML and others. These might provide better results as well.
Further, as per this model, the most important variables are:
from h2o.automl import H2OAutoML
# Split the data
train,test = train_test_split(df,test_size = 0.5, random_state = 1234)
# Convert to a h2o dataframe for computation
df_hex = h2o.H2OFrame(df)
train_hex = h2o.H2OFrame(train)
test_hex = h2o.H2OFrame(test)
# This test_hex will be used all througout the models, for prediction
# Selecting all predictor variables
predictors = ['AP001', 'AP002', 'AP003', 'AP004', 'AP006', 'AP007',
'AP008', 'AP009', 'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
'CR009', 'CR015', 'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005',
'MB007', 'Loan_app_day_name', 'Loan_app_month', 'Loan_app_time',
'Loan_app_day']
target = 'loan_default'
len(predictors)
len(df.columns.to_list())
AutoML performs a hyperparameter search over a variety of H2O algorithms in order to deliver the best model. In the table on the link given, we list the hyperparameters, along with all potential values that can be randomly chosen in the search. If these models also have a non-default value set for a hyperparameter, we identify it in the list as well. Random Forest and Extremely Randomized Trees are not grid searched (in the current version of AutoML), so they are not included in the list.
AutoML does not run a grid search for GLM. Instead AutoML builds a single model with lambda_search enabled and passes a list of alpha values. It returns only the model with the best alpha-lambda combination rather than one model for each alpha.
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
The max_runtime_secs argument provides a way to limit the AutoML run by time. When using a time-limited stopping criterion, the number of models train will vary between runs. If different hardware is used or even if the same machine is used but the available compute resources on that machine are not the same between runs, then AutoML may be able to train more models on one run vs another.
The test frame is passed explicitly to the leaderboard_frame argument here, which means that instead of using cross-validated metrics, we use test set metrics for generating the leaderboard.
### No more supported in latest version of H20 Auto ML. It runs by default.
# select the values for parameters to grid over
# hyper_params = {'max_runtime_secs': [100, 300, 500, 800, 1000, 1500, 1800, 2000],
# 'max_models': [10, 20, 50, 100, 150, 200, 250, 300, 500]}
# initialize the glm estimator
# aml_grid = H2OAutoML()
# build grid search with previously made GLM and hyperparameters
# grid = H2OGridSearch(aml_grid,
# hyper_params = hyper_params)
# train using the grid
# grid.train(x = predictors, y = target, training_frame = train_hex, validation_frame = test_hex)
I will manually specify the basic parameters required to run the model.
The current version of AutoML trains and cross-validates the following algorithms (in the following order): three pre-specified XGBoost GBM (Gradient Boosting Machine) models, a fixed grid of GLMs, a default Random Forest (DRF), five pre-specified H2O GBMs, a near-default Deep Neural Net, an Extremely Randomized Forest (XRT), a random grid of XGBoost GBMs, a random grid of H2O GBMs, and a random grid of Deep Neural Nets. In some cases, there will not be enough time to complete all the algorithms, so some may be missing from the leaderboard. AutoML then trains two Stacked Ensemble models (more info about the ensembles below). Particular algorithms (or groups of algorithms) can be switched off using the exclude_algos argument.
aml_1 = H2OAutoML(project_name = 'aml_1',
max_runtime_secs = 1800, # 30 minutes
max_models = 200, # test over 200 different models
nfolds = 5, # 5 fold cross validation for each model
stopping_rounds = 0,
stopping_tolerance = 0.005,
balance_classes = False, # True only for classification problem
seed = 1234)
aml_1.train(x = predictors,
y = target,
training_frame = train_hex)
We will view the AutoML Model Leaderboard. Since we specified a leaderboard_frame in the H2OAutoML.train() method for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.
The AutoML object includes a “leaderboard” of models that were trained in the process, including the 5-fold cross-validated model performance (by default). The number of folds used in the model evaluation process can be adjusted using the nfolds parameter. If you would like to score the models on a specific dataset, you can specify the leaderboard_frame argument in the AutoML run, and then the leaderboard will show scores on that dataset instead.
A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of regression, the default ranking metric is mean residual deviance. In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.
lb = aml_1.leaderboard.head()
lb
aml_1.leader
I tried Feature selection based on variable importance, but it degrades the model performance. So, I will use all the features.
# Check for tree models to grab variable importance
lb[:5,"model_id"]
# Stacked Ensemble models cannot give variable importance, so I will grab the third model, which is GBM
m = h2o.get_model(lb[2,"model_id"])
m
print('The top 10 most important variables are: ')
m.varimp(use_pandas=True)[:10]
m.varimp_plot(num_of_features = 30)
These are just for illustrative purpose:
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/explain.html
Heatmap

Correlation Map

SHAP Plot

Individual Conditional Expectation (ICE)

Use the original test_hex set for prediction
def actual_predict(model,test_hex,target):
y_pred = model.predict(test_hex).as_data_frame()
y_actual = test_hex[target].as_data_frame()
df_actual_predict = pd.concat([y_actual,y_pred],axis=1)
df_actual_predict.columns = ['actual','pred']
return(df_actual_predict)
y_pred = aml_1.predict(test_hex).as_data_frame()
y_actual = test_hex[target].as_data_frame()
y_pred.head()
AML_actual_predict = actual_predict(aml_1,test_hex,target)
AML_actual_predict.head()
pred = aml_1.predict(test_hex)
pred.head()
perf = aml_1.leader.model_performance(test_hex)
perf
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was developed for operators of military radar receivers, which is why it is so named.
dd = AML_actual_predict
AML_roc_auc_value = roc_auc_score(dd['actual'],dd['pred'])
AML_roc_auc_value
AML_gains = gains_table(AML_actual_predict)
AML_gains
ROC_PR(AML_actual_predict)
I see that in the AML model, the result is very satisfactory.
In our dataset, the minority class represents 19.3% or around 1/5th of the total data, and the majority class represents 81.7% or 4/5th of the total data, or around 4 times the minority class. The underlying assumption for the stable results is that this dataset is not a bad distribution of class, and 20% data in a one class is fine for building a stable model. If we had more than 5-6 times of the minority class data in the majority class, then that would be be settled by balancing the classes.
Also, for the train test split, I found that a split of 50:50 works best for all my models. I tried other combination, but a 50% train gives me the best Lift score of 2.18 in both the models. So, I will continue with the same split ratio further.
The Best model given was a Stacked Ensemble Model, with the following output:
The Auto ML Model parameters used are:
Please see below for the meaningful business insights:
H2O package is a very effective and efficient package to build a machine learning model for predicting loan default. Also, H2O package is very handy to display the variable importance, handle correlations, and also dummy code the categorical variables.
Gains table and Lift:
For the final model I built after tuning all the models on various different values of each parameter and finally tuning all the hyper-parameters for the best result, the highest Lift score I obtained is 2.18, which is good as per industry standards. A Lift score of around 2 is suitable for the model to be of acceptable standards.
ROC and AUC: The area under the ROC curve (AUC) assesses overall classification performance. But, AUC does not place more emphasis on one class over the other, so it does not reflect the minority class well. The Precision-Recall (PR) curves will be more informative than ROC when dealing with highly skewed datasets. The PR curves plot precision vs. recall (FPR). Because Precision is directly influenced by class imbalance so the Precision-recall curves are better to highlight differences between models for highly imbalanced data sets.
However, both of them do not accurately represent the results, as one doesnt reflect the minority class well, and the other is sensitive to imbalanced data. Hence, we use the H2O package, which takes care of all these problems for us. We get an AUC of 0.7 and PR of 0.35, which is acceptable, but I will try to improve them further.
The best lift is 2.18
As per my model building analysis, I saw that h2o Random Forest (through Over Sampling) gave me the best Lift Score of 3.01, as was seen in the previous week's submission.
For this week, out of GLM and AML, after fine tuning all parameters, I find that AML gives a slightly better result, and the highest Lift Score is 2.18, from the AML model.
Now, AML is a smart algorithm, which makes many different models, such as RF, GBM, GLM, etc. However, I also suggest using either Random Forest or GBM separately, to fine tune the parameters, for predicting Loan Defaults. However, other models such as xgboost, and others might also be explored.